Search CORE

104 research outputs found

Impact of Biases in Big Data

Author: Glauner Patrick
State Radu
Valtchev Petko
Publication venue
Publication date: 01/01/2018
Field of study

The underlying paradigm of big data-driven machine learning reflects the desire of deriving better conclusions from simply analyzing more data, without the necessity of looking at theory and models. Is having simply more data always helpful? In 1936, The Literary Digest collected 2.3M filled in questionnaires to predict the outcome of that year's US presidential election. The outcome of this big data prediction proved to be entirely wrong, whereas George Gallup only needed 3K handpicked people to make an accurate prediction. Generally, biases occur in machine learning whenever the distributions of training set and test set are different. In this work, we provide a review of different sorts of biases in (big) data sets in machine learning. We provide definitions and discussions of the most commonly appearing biases in machine learning: class imbalance and covariate shift. We also show how these biases can be quantified and corrected. This work is an introductory text for both researchers and practitioners to become more aware of this topic and thus to derive more reliable models for their learning problems

arXiv.org e-Print Archive

Open Repository and Bibliography - Luxembourg

On the Reduction of Biases in Big Data Sets for the Detection of Irregular Power Usage

Author: Duarte Diogo
Glauner Patrick
State Radu
Valtchev Petko
Publication venue
Publication date: 01/01/2018
Field of study

In machine learning, a bias occurs whenever training sets are not representative for the test data, which results in unreliable models. The most common biases in data are arguably class imbalance and covariate shift. In this work, we aim to shed light on this topic in order to increase the overall attention to this issue in the field of machine learning. We propose a scalable novel framework for reducing multiple biases in high-dimensional data sets in order to train more reliable predictors. We apply our methodology to the detection of irregular power usage from real, noisy industrial data. In emerging markets, irregular power usage, and electricity theft in particular, may range up to 40% of the total electricity distributed. Biased data sets are of particular issue in this domain. We show that reducing these biases increases the accuracy of the trained predictors. Our models have the potential to generate significant economic value in a real world application, as they are being deployed in a commercial software for the detection of irregular power usage

arXiv.org e-Print Archive

Open Repository and Bibliography - Luxembourg

Improving SOA antipatterns detection in Service Based Systems by mining execution traces

Author: Moha Naouel
Nayrolles Mathieu
Valtchev Petko
Publication venue
Publication date: 01/01/2013
Field of study

Crossref

Archipel - Université du Québec à Montréal

The Challenge of Non-Technical Loss Detection using Artificial Intelligence: A Survey

Author: Bettinger Franck
Glauner Patrick
Meira Jorge Augusto
State Radu
Valtchev Petko
Publication venue: 'Atlantis Press'
Publication date: 01/01/2017
Field of study

Detection of non-technical losses (NTL) which include electricity theft, faulty meters or billing errors has attracted increasing attention from researchers in electrical engineering and computer science. NTLs cause significant harm to the economy, as in some countries they may range up to 40% of the total electricity distributed. The predominant research direction is employing artificial intelligence to predict whether a customer causes NTL. This paper first provides an overview of how NTLs are defined and their impact on economies, which include loss of revenue and profit of electricity providers and decrease of the stability and reliability of electrical power grids. It then surveys the state-of-the-art research efforts in a up-to-date and comprehensive review of algorithms, features and data sets used. It finally identifies the key scientific and engineering challenges in NTL detection and suggests how they could be addressed in the future

arXiv.org e-Print Archive

Directory of Open Access Journals

Open Repository and Bibliography - Luxembourg

Classification of concepts through products of concepts and abstract data types (abstract)

Author: Euzenat Jérôme
Valtchev Petko
Publication venue: No commercial editor.
Publication date: 20/06/1995
Field of study

valtchev1995aInternational audienceThe classification scheme formalism represents in a uniform manner both usual data types and structured objects is introduced. It is here provided with a dissimilarity measure which only takes into account the structure of a given domain: a partial order over a set of classes. The measure we define compares a couple of individuals according to their mutual position within the taxonomy structuring the underlying domain. It is then used to design a classification algorithm to work on structured objects

INRIA a CCSD electronic archive server

Une stratégie de construction de taxonomies dans les objets

Author: Euzenat Jérôme
Valtchev Petko
Publication venue: HAL CCSD
Publication date: 01/01/1999
Field of study

valtchev1999cNational audienceConstruire automatiquement une taxonomie de classes à partir d'objets co-définis et indiférenciables n'est pas une tâche aisée. La partition de l'ensemble d'objets en domaines et la hiérarchisation de ces domaines par la relation de composition permettent de différencier les objets et d'éviter certains cycles impliquant une relation de composition. Par ailleurs, l'utilisation d'une dissimilarité bâtie sur les taxonomies de classes existantes dans certains domaines permet d'éviter de traiter d'autres cycles. Il subsite cependant des références circulaires qui sont alors circonscrites à une partie bien identifiée des domaines

INRIA a CCSD electronic archive server

An integrative proximity measure for ontology alignment

Author: Euzenat Jérôme
Valtchev Petko
Publication venue: No commercial editor.
Publication date: 20/10/2003
Field of study

euzenat2003hInternational audienceIntegrating heterogeneous resources of the web will require finding agreement between the underlying ontologies. A variety of methods from the literature may be used for this task, basically they perform pair-wise comparison of entities from each of the ontologies and select the most similar pairs. We introduce a similarity measure that takes advantage of most of the features of OWL-Lite ontologies and integrates many ontology comparison techniques in a common framework. Moreover, we put forth a computation technique to deal with one-to-many relations and circularities in the similarity definitions

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Using FCA to Suggest Refactorings to Correct Design Defects

Author: Ghizlane El Boussaidi
Moha Naouel
Rezgui Jihene
Valtchev Yann-Gaël Guéhéneuc and Petko
Publication venue: Lecture Notes in Computer Science (LNCS), Springer
Publication date: 01/11/2006
Field of study

Design defects are poor design choices resulting in a hard-to- maintain software, hence their detection and correction are key steps of a\ud disciplined software process aimed at yielding high-quality software\ud artifacts. While modern structure- and metric-based techniques enable\ud precise detection of design defects, the correction of the discovered\ud defects, e.g., by means of refactorings, remains a manual, hence\ud error-prone, activity. As many of the refactorings amount to re-distributing\ud class members over a (possibly extended) set of classes, formal concept\ud analysis (FCA) has been successfully applied in the past as a formal\ud framework for refactoring exploration. Here we propose a novel approach\ud for defect removal in object-oriented programs that combines the\ud effectiveness of metrics with the theoretical strength of FCA. A\ud case study of a specific defect, the Blob, drawn from the\ud Azureus project illustrates our approach

Archipel - Université du Québec à Montréal

Is Big Data Sufficient for a Reliable Detection of Non-Technical Losses?

Author: Bettinger Franck
Glauner Patrick
Meira Jorge
Migliosi Angelo
State Radu
Valtchev Petko
Publication venue
Publication date: 25/07/2017
Field of study

Non-technical losses (NTL) occur during the distribution of electricity in power grids and include, but are not limited to, electricity theft and faulty meters. In emerging countries, they may range up to 40% of the total electricity distributed. In order to detect NTLs, machine learning methods are used that learn irregular consumption patterns from customer data and inspection results. The Big Data paradigm followed in modern machine learning reflects the desire of deriving better conclusions from simply analyzing more data, without the necessity of looking at theory and models. However, the sample of inspected customers may be biased, i.e. it does not represent the population of all customers. As a consequence, machine learning models trained on these inspection results are biased as well and therefore lead to unreliable predictions of whether customers cause NTL or not. In machine learning, this issue is called covariate shift and has not been addressed in the literature on NTL detection yet. In this work, we present a novel framework for quantifying and visualizing covariate shift. We apply it to a commercial data set from Brazil that consists of 3.6M customers and 820K inspection results. We show that some features have a stronger covariate shift than others, making predictions less reliable. In particular, previous inspections were focused on certain neighborhoods or customer classes and that they were not sufficiently spread among the population of customers. This framework is about to be deployed in a commercial product for NTL detection.Comment: Proceedings of the 19th International Conference on Intelligent System Applications to Power Systems (ISAP 2017

arXiv.org e-Print Archive

Crossref

Open Repository and Bibliography - Luxembourg